========================================================
The data set contains information on red wines and the chemical properties that influence the quality of red wines. I selected this data set out of curiosity of red wines and was curious what insights could be gleaned. I found the following definitions and attributes from another Udacity wine project, and listed the site in the resources section. I found the attributes helpful when looking into the different variables.
head(wineData, 10)
str(wineData)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
sum(is.na(wineData))
## [1] 0
I loaded the first 10 rows and ran the str) function to view a compact summary plus understand the overall data structure. Finally, ran sum(is.na) on data set to verify there aren’t any missing values in the data set.
ggplot(data = wineData) +
geom_histogram(mapping = aes(x=fixed.acidity), binwidth = 0.5) +
scale_x_continuous(breaks = seq(0,17,1)) +
xlab("Fixed Acidity") +
ylab("Count")
summary(wineData$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The histogram is skewed right. The fixed acidity in most red wines is approximately between 6.5 \(g/dm^3\) and 7.5 \(g/dm^3\). The median (7.90) and mean (8.32) are pulled to the left, and tail is to the right, which are all indicators of a right skewed distribution.
ggplot(data = wineData) +
geom_histogram(mapping = aes(x=volatile.acidity), binwidth = 0.04) +
scale_x_continuous(breaks = seq(0,1.6,0.1)) +
xlab("Volatile Acidity") +
ylab("Count")
summary(wineData$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The histogram is skewed right. The volatile acidity in most red wines is approximately between 0.35 \(g/dm^3\) and 0.65 \(g/dm^3\). The median (0.52) and mean (≈ 0.53) are pulled to the left, and tail is to the right. There might be some outliers around 1.3 \(g/dm^3\) and 1.55 \(g/dm^3\).
ggplot(data = wineData) +
geom_histogram(mapping = aes(x=citric.acid), binwidth = 0.03) +
scale_x_continuous(breaks = seq(0,1,0.125)) +
xlab("Citric Acid") +
ylab("Count")
summary(wineData$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The histogram is skewed right. The citric acid in most red wines is approximately 0.0 \(g/dm^3\). The median (0.26) and mean (0.27) are pulled to the left.
ggplot(data = wineData) +
geom_histogram(mapping = aes(x=residual.sugar), binwidth = 0.2) +
xlim(1,7) +
xlab("Residual Sugar") +
ylab("Count")
## Warning: Removed 31 rows containing non-finite values (stat_bin).
summary(wineData$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The histogram is skewed right. The residual sugar in most red wines is approximately 2.0 \(g/dm^3\). The median (2.2) and mean (2.539) are pulled to the left. I also used xlim() to remove outliers to create a cleaner visual.
ggplot(data = wineData) +
geom_histogram(mapping = aes(x=chlorides), binwidth = 0.01) +
xlim(0,0.2) +
xlab("Chlorides") +
ylab("Count")
## Warning: Removed 41 rows containing non-finite values (stat_bin).
summary(wineData$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The histogram is skewed right. The amount of chlorides found in most red wines is approximately 0.78 \(g/dm^3\). The median (0.79) and mean (≈ 0.087) are pulled to the left. I also used xlim() to remove outliers to create a cleaner visual.
ggplot(data = wineData) +
geom_histogram(mapping = aes(x=free.sulfur.dioxide), binwidth = 2) +
scale_x_continuous(breaks = seq(0,70,5)) +
xlab("Free Sulfur Dioxide") +
ylab("Count")
summary(wineData$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The histogram is skewed right. The citric acid in most red wines is approximately 6.0 \(mg/dm^3\). The median (14.0) and mean (15.87) are pulled to the left.
ggplot(data = wineData) +
geom_histogram(mapping = aes(x=total.sulfur.dioxide), binwidth = 5) +
xlim(0,175) +
xlab("Total Sulfur Dioxide") +
ylab("Count")
## Warning: Removed 2 rows containing non-finite values (stat_bin).
summary(wineData$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The histogram is skewed right. The amount of total sulfur dioxide found in most red wines is approximately 25.0 \(mg/dm^3\). The median (38.0) and mean (46.47) are pulled to the left. I also used xlim() to remove outliers to create a cleaner visual.
ggplot(data = wineData) +
geom_histogram(mapping = aes(x=density), binwidth = 0.0005) +
scale_x_continuous(breaks = seq(0.9,1.05,0.0025)) +
xlab("Density") +
ylab("Count")
summary(wineData$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The histogram is skewed right. The density of most red wines is approximately 0.996 \(g/cm^3\). The median (0.9968) and mean (0.9967) are pulled to the left.
ggplot(data = wineData) +
geom_histogram(mapping = aes(x=pH), binwidth = 0.05) +
scale_x_continuous(breaks = seq(0,4,0.1)) +
xlab("pH") +
ylab("Count")
summary(wineData$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The histogram is skewed right. The pH of most red wines is approximately 3.3 pH units. The median (3.31) and mean (3.311) are pulled to the left.
ggplot(data = wineData) +
geom_histogram(mapping = aes(x=sulphates), bandwidth = 0.25) +
xlim(0.25, 1.5) +
xlab("Sulphates") +
ylab("Count")
## Warning: Ignoring unknown parameters: bandwidth
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
summary(wineData$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The histogram is skewed right. The amount of sulphatees found in most red wines is approximately 0.55 \(g/dm^3\). The median (0.62) and mean (≈ 0.65) are pulled to the left. I also used xlim() to remove outliers to create a cleaner visual.
ggplot(data = wineData) +
geom_histogram(mapping = aes(x=alcohol), binwidth = 0.5) +
scale_x_continuous(breaks = seq(8,15,1)) +
xlab("Alcohol") +
ylab("Count")
summary(wineData$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The histogram is skewed right. The amount of alcohol found in most red wines is approximately 8.0\(\%\) by volume. The median (10.2) and mean (10.42) are pulled to the left. I also used xlim() to remove outliers to create a cleaner visual.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The data set of red wine quality follows a normal distribution. Most wines are rated at a quality level of 5.0. When looking at other measures of central tendancy, the 1st quartile, median, 3rd quartile, and mean values are 5.0, 6.0, 6,0, and 5.636 respectively.
The data set contains 12 variables, 1599 rows.
I am interested in looking at which ingredients drive quality ratings.
I would also like to investigate any corrleations found between acidity levels & pH, sulphates & density, or residual sugars & density.
No.
No, I did not perform any operations on the dataset. But, I did reduce the xlim() max value on a few graphs to clean up the visualizations. Also regarding the quality graph, I’ve never seen a 2nd and 3rd quartile value equal the same value, found this to be interesting, and would like to dig into it further.
As stated in the previous section I would like to investigate correlations between alcohol & quality, acidity levels & pH, sulphates & density, or residual sugars & density. To start out, I’m going to use corrplot \(^2\) and other correlation matrices to determine which features have the strongest relationships.
wineCorrelation <- cor(wineData)
significance <- cor.mtest(wineCorrelation, conf.level = .95)
corrplot(wineCorrelation, method = "number",
p.mat = significance$p, sig.level = .05, insig = "blank", order="hclust", tl.col = "black",type="upper",tl.srt=45,
col=brewer.pal(n=10,name="RdYlBu"))
In the above figures, correlations with p-value > 0.05 are considered insignificant. In this case the correlation coefficient values are were left blank. The term insignifican comes from corrplot. However, I believe the terminology is slightly misleading. Another way of thinking about this is the observed statistic was closer to zero (i.e. null hypothesis) rather away from the test statistic. Therefore the values were left blank in this situation. It’s not that the value was actually insignificant, but rather it did not exceed the test statistic value or fall within the area made up by the p-value which would make it statistically significant.
head(round(wineCorrelation, 4))
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.0000 -0.2561 0.6717
## volatile.acidity -0.2561 1.0000 -0.5525
## citric.acid 0.6717 -0.5525 1.0000
## residual.sugar 0.1148 0.0019 0.1436
## chlorides 0.0937 0.0613 0.2038
## free.sulfur.dioxide -0.1538 -0.0105 -0.0610
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.1148 0.0937 -0.1538
## volatile.acidity 0.0019 0.0613 -0.0105
## citric.acid 0.1436 0.2038 -0.0610
## residual.sugar 1.0000 0.0556 0.1870
## chlorides 0.0556 1.0000 0.0056
## free.sulfur.dioxide 0.1870 0.0056 1.0000
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.1132 0.6680 -0.6830 0.1830 -0.0617
## volatile.acidity 0.0765 0.0220 0.2349 -0.2610 -0.2023
## citric.acid 0.0355 0.3649 -0.5419 0.3128 0.1099
## residual.sugar 0.2030 0.3553 -0.0857 0.0055 0.0421
## chlorides 0.0474 0.2006 -0.2650 0.3713 -0.2211
## free.sulfur.dioxide 0.6677 -0.0219 0.0704 0.0517 -0.0694
## quality
## fixed.acidity 0.1241
## volatile.acidity -0.3906
## citric.acid 0.2264
## residual.sugar 0.0137
## chlorides -0.1289
## free.sulfur.dioxide -0.0507
I used round() to make this data set more readable.
set.seed(1599)
wine_subset <- wineData[,c(1:12)]
names(wine_subset)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
ggpairs(wine_subset[sample.int(nrow(wine_subset), 1000),])
Simple scatter plot matrix\(^3\).
Now that I looked at various matrices, I’ve determined which relationships I want to look at. Some items of interest are aclohol & quality, and fixed acidity & pH.
ggplot(data = wineData) +
geom_bar(
mapping = aes(x = round(alcohol), fill = factor(quality)),
position = "dodge"
)
ggplot(data = wineData, mapping = aes(x=alcohol, y=quality)) +
geom_point(mapping = aes(color = factor(quality))) +
geom_smooth(color='red',method="lm")
round(cor(wineData$alcohol, wineData$quality), 2)
## [1] 0.48
Created grouped bar chart and scatter plot to look at alcohol and quality. There is a moderate positive correlation of approximately (0.48) between alcohol and quality.
round(cor(wineData$quality, wineData$total.sulfur.dioxide), 2)
## [1] -0.19
I created a jittering scatter plot to view the relationship between quality and total sulfur dioxide. The data looks to like it is normally distributed. It has a weak correlation of apporximately -0.19.
ggplot(data = wineData, mapping = aes(x=alcohol, y=density)) +
geom_point(aes(position = "jitter", color="Density")) +
geom_smooth(method="lm")
## Warning: Ignoring unknown aesthetics: position
round(cor(wineData$alcohol, wineData$density), 2)
## [1] -0.5
This scatterplot looks at the relationship between alchol and density. As alcohol increases, density decreases. This inverse relationship has a moderate correlation of apporximately -0.5.
ggplot(data = wineData, mapping = aes(x=fixed.acidity, y=pH)) +
geom_point(aes(position = "jitter"),color="blue") +
geom_smooth(color='red',method="lm")
## Warning: Ignoring unknown aesthetics: position
round(cor(wineData$fixed.acidity, wineData$pH), 2)
## [1] -0.68
Fixed acidity and pH also have an inverse relationship. As fixed acidity increases, pH decreases. The relationship has a strong correlation of apporximately -0.68.
ggplot(data = wineData, mapping = aes(x = citric.acid, y = pH)) +
geom_point() +
geom_smooth(color='hotpink',method="lm")
round(cor(wineData$citric.acid, wineData$pH), 2)
## [1] -0.54
As citric acid increases, pH decreases. This is an inverse relationship, with a strong correlation of apporximately -0.54. This is similar to the previous graph.
ggplot(data = wineData, mapping = aes(x = citric.acid, y = volatile.acidity)) +
geom_point() +
geom_smooth(color='orange',method="lm")
round(cor(wineData$citric.acid, wineData$volatile.acidity), 2)
## [1] -0.55
As citric acid increases, volatile acidity decreases. This is an inverse relationship, with a strong correlation of apporximately -0.55.
ggplot(data = wineData, mapping = aes(y=residual.sugar, x=alcohol)) +
geom_point() +
geom_smooth(color='blue',method="lm")
round(cor(wineData$alcohol, wineData$residual.sugar), 2)
## [1] 0.04
The percent of alcohol and residual sugar have a very weak correlation of apporximately 0.04.
ggplot(data = wineData, mapping = aes(x=residual.sugar, y=density)) +
geom_point() +
geom_smooth(method = "lm")
round(cor(wineData$residual.sugar, wineData$density), 2)
## [1] 0.36
As residual sugar increases density tends to increase. The correlation of this relationship is apporximately 0.36, which is considered moderate.
ggplot(data = wineData, mapping = aes(x=fixed.acidity, y = density)) +
geom_point(aes(position = "jitter", color="Density")) +
geom_smooth(method = "lm")
## Warning: Ignoring unknown aesthetics: position
round(cor(wineData$fixed.acidity, wineData$density), 2)
## [1] 0.67
There is a direct relationship between fixed acidity and density. As fixed acidity increases, density increases. The relationship has a strong correlation of apporximately 0.67.
ggplot(data = wineData, mapping = aes(x=fixed.acidity, y=citric.acid)) +
geom_point() +
geom_smooth(method = "lm")
round(cor(wineData$fixed.acidity, wineData$citric.acid), 2)
## [1] 0.67
Fixed acidity and citric acid also have a direct relationship. As fixed acidity increases, citric acid increases. The relationship has a strong correlation of apporximately 0.67. One question I have is whether citric acids also fall into the category of fixed acids?
ggplot(data = wineData, mapping = aes(x=chlorides, y=citric.acid)) +
geom_point() +
geom_smooth(method = "lm")
round(cor(wineData$chlorides, wineData$citric.acid), 2)
## [1] 0.2
There is a positive correlation between chlorides and citric acid. It is a weak correlation of apporximately -0.2.
ggplot(data = wineData, mapping = aes(x=free.sulfur.dioxide, y=total.sulfur.dioxide)) +
geom_point(position = "jitter") +
geom_smooth(method = "lm")
round(cor(wineData$free.sulfur.dioxide, wineData$total.sulfur.dioxide), 2)
## [1] 0.67
There is a direct relationship between free sulfur dioxide and total sulfur dioxide. As free sulfur dioxide increases, total sulfur dioxide increases. The relationship has a strong correlation of apporximately 0.67. This makes sense since free sulfur dioxide makes up the free forms of \(SO_2\) existing in total sulfar dioxide.
I was most interested in alcohol and quality. My initial assumption was there would be a direct relationship between alcohol and quality. Although there was a moderate positive correlation (≈ 0.48), I thought the correlation would be stronger.
The correlation between fixed acidity and pH is approximately -0.68. As fixed acidity increases, pH decreases. This makes sense considering the pH scale. Substances are considered more acidic as their pH approaches zero. I found it interesting that the pH range of wines in this study are equivalent to the pH of orange juice, soda, and acid rain.
I also thought it was interesting that there is not a strong relationship between percent of alcohol and residual sugar. My initial thought was wine with higher alcohol content had less sugar remaining. However, the trend shows little change in residual sugar as alcholol increases. This leads me to question how much sugar is placed into the various wines prior to fermintation? Could higher alcohol content receive higher amounts of sugar? Is there a formula to follow to acheive a desired alcohol level which leads to similar amounts of residual sugar?
The strongest correlation I found exists between fixed acidity and pH (≈ -0.68).
r <- cor(wineData$citric.acid, wineData$volatile.acidity)
rSquare <- r^2
r
## [1] -0.5524957
rSquare
## [1] 0.3052515
I created a scatterplot to display the relationship between citric acid, volatile acidity, and quality. Since the correlation coefficient r for citric acid and volatile acidity is approximately 0.55, then its correlation of determination r\(^2\) is approximately 0.31. If r-squared is 0.31 then it means 31\(\%\) of variations in volatile acidity are explained by the citric acid in this model.
ggplot(data = wineData, mapping = aes(x=fixed.acidity, y=pH, color=factor(quality))) +
geom_point(alpha = 0.5, size = 1) +
geom_smooth(method = "lm") +
scale_color_brewer(type = "seq")
r_2 <- cor(wineData$fixed.acidity, wineData$pH)
rSquare_2 <- r_2^2
r_2
## [1] -0.6829782
rSquare_2
## [1] 0.4664592
I created a scatter plot to look at the relationship between fixed acidity & pH, and grouped the dots by quality. Since the correlation coefficient r for fixed acidity and pH is approximately 0.68, then the correlation of determination r\(^2\) is approximately 0.47. If r-squared is 0.47 then it means 47\(\%\) of variations in pH are explained by the fixed acidity in this model.
ggplot(data = wineData, mapping = aes(x=alcohol, y=density)) +
geom_point(mapping = aes(position = "jitter", color= quality), alpha = .4, shape = 16, size = 5) +
guides(colour = guide_legend()) +
scale_color_gradient(low="blue", high="orange", trans = 'reverse') +
geom_smooth(method="lm")
## Warning: Ignoring unknown aesthetics: position
I had to set scale_color_gradient low to blue, and high to orange in order to get my intended result of 3 set to orange, and 8 set to blue. An oddity happened after I set trans to reverse, the values reversed, but the colors did too.
set.seed(1599)
wine_subset <- wineData[c(8,11,12)]
names(wine_subset)
## [1] "density" "alcohol" "quality"
ggpairs(wine_subset[sample.int(nrow(wine_subset), 1000),])
r_3 <- cor(wineData$alcohol, wineData$density)
rSquare_3 <- r_3^2
r_3
## [1] -0.4961798
rSquare_3
## [1] 0.2461944
The two visualizations above show the relationship between alcohol, density, and quality. The box plot provided a helpful visualization while ggpairs() provided a matrix of the plots and provided their correlation coefficients quickly.
The scatterplot featuring citric acid, volatile acitidy, and quality shows that as citric acid increases, volatile acitidy tends to decrease, and quality tends to increase. While the scatterplot featuring fixed acidity, pH, and quality shows that as fixed acidity increases pH and quality tend to decrease.
The graphs featuring alcohol, density, and quality show that as alcohol content increases, density tends to decrease, and quality tends to increase. There is a weak correlation between density and quality (≈ -0.159), while there are moderate correlations in both alcohol & density (≈ -0.489), and alcohol & quality (≈ 0.472).
I had trouble finding a simple function that calculates r-squared. What I really wanted to do is place a regression line on a scatterplot and place the r-squared value on the graph.
I found it interesting that while some items increased others decreased (e.g. alcohol, density, and quality). I found some additional anomalies that are noted in the final plot section under plot three.
ggplot(data = wineData, mapping = aes(x=residual.sugar, color=factor(quality))) +
geom_histogram(binwidth = 0.2) +
xlim(1,7) +
xlab("Residual Sugar") +
ylab("Count")
## Warning: Removed 31 rows containing non-finite values (stat_bin).
summary(wineData$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
As stated in the univariate plot section, this histogram is skewed right. The residual sugar in most red wines is approximately 2.0 \(g/dm^3\). The median (2.2) and mean (2.539) are pulled to the left. I used xlim() to remove outliers to create a cleaner visual. This time, I also grouped the data by quality. Since quality followed a normal distribution in the univariate plot section, we can see the majority of wines recieved a quailty rating of 5 or 6. The color coding in this graph helps to uncover to uncover this.
ggplot(data = wineData, mapping = aes(x=alcohol)) +
geom_point(aes(y=pH,color="pH")) +
geom_point(aes(y=(density-.96)*100,color="Density")) +
geom_smooth(aes(y=pH), color="#33CCCC", method = "lm") +
geom_smooth(aes(y=(density-.96)*100), color="#FF6666", method = "lm") +
scale_y_continuous(name="Density",sec.axis = sec_axis(~((./100)+.96), name="pH"))
This graph displays two different relationships. A direct relationship between alcohol & density, and an inverse relationship between alcohol & pH. As alcohol increases, density decreases and pH increases. It can also be inferred that as pH increases, density decreases. This is also verified in the simple scatterplot matrix found in the bivariate plots section.
Throughout this entire project, I had the most fun working on this graph. The reason is I plotted a second variable (pH) along the vertical axis and it was not as simple as when working with software like excel. I had to use the scale_y_continuous() function which required a specific formula (i.e. ~((./100)+.96)) to scale the values for this axis to a desired set of numbers.
wineData$qualityBins = cut(wineData$alcohol,
c(8:16))
ggplot(wineData, mapping = aes(x = factor(round(alcohol)), y = density)) +
geom_boxplot(aes(fill = qualityBins))
As stated in the multivariate plots section, as alcohol content increases, density tends to decrease, and quality tends to increase. I wanted to look at this data set using a different type of graph and chose a box plot. I noticed there are multiple dots below and above the min and max whiskers. I have never observed this before. I found out that some of these additional dots are outliers. Outliers should carefully be considered before being removed \(^4\). Outliers, lie 1.5 x IQR (interquartile range) below the 1st or above the 3rd quartile ranges. Running IQR() on pH provides one value for the entire range, and I could easily multiply this by 1.5. However, I believe I need to calculate multiple IQR values since I grouped data on citric.acid for this visualization. This is something I would like to research further.
I started this project with the thought that I would be most interested in determining whether or not there was a relationship between alcohol & quality. As I started working on the project, I became most interested in the relationship between fixed acidity and pH. I was able to observe the inverse relationship and directly relate the pH values to common items I typically consume. The first section dealt with simple graphs, which were not very interesting, but as the sections grew more complex, the visualizations became more interesting. One item I found helpful when working with geom_point() was to set position to jitter. This feature spread the data points slightly and created some random noise which helped clean up the visuals by preventing overplotting. I feel confident in creating graphs and researching the visuals I want to build, but I need to get better at looking for patterns, playing with statistical formulas, understanding and explaining the results.